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Abstract. Performing effective preference-based data retrieval re- 
quires detailed and preferentially meaningful structurized informa- 
tion about the current user as well as the items under consideration. 
A common problem is that representations of items often only con- 
sist of mere technical attributes, which do not resemble human per- 
ception. This is particularly true for integral items such as movies or 
songs. It is often claimed that meaningful item features could be ex- 
tracted from collaborative rating data, which is becoming available 
through social networking services. However, there is only anecdotal 
evidence supporting this claim; but if it is true, the extracted infor- 
mation could very valuable for preference-based data retrieval. In 
this paper, we propose a methodology to systematically check this 
common claim. We performed a preliminary investigation on a large 
collection of movie ratings and present initial evidence. 



1 INTRODUCTION 

Recommender systems fTlll7l are one of the most prominent applica- 
tions of preference handling technology |6| and a highly active area 
of research. In particular, fueled by the Netflix competition and its 
one million dollar prize money [2|, research on collaborative recom- 
mendation techniques [211 has recently made significant advances, 
most notably through the introduction of factor models 1 161 1221 . 

In collaborative recommender systems, users repeatedly express 
their preferences for items, which usually is done by giving explicit 
ratings on some predefined numerical scale. This data can be mod- 
eled using a rating matrix, whose rows correspond to items, columns 
to users, and entries to ratings. Typically, ratings matrices are very 
sparse, that is, only a small fraction of all possible ratings have actu- 
ally been observed. Personalized recommendations are generated by 
predicting unobserved ratings from the available data and, for each 
user, selecting those items considered to be most appealing. 

Most state-of-the-art collaborative recommendation methods — 
including the winner of the Netflix Prize — are based on factor mod- 
els, which are known to yield much more accurate predictions than 
traditional neighborhood-based methods fl4l[l5l[22l[23ll24l . In fac- 
tor models, each user and each item is represented by a vector in 
some shared real coordinate space. The vectors are chosen such that 
each observed rating is closely approximated by the dot product of 
the corresponding item and user vectors. The selection of coordinates 
usually is formalized as an optimization problem. Predictions for un- 
observed ratings are generated by computing the respective scalar 
products. Equivalently, this approach can be seen as a factorization 
of the rating matrix into the product of an item matrix (whose rows 

1 Institut fur Informationssysteme, Technische Universitat Braunschweig, 
Germany 



are the item vectors) and a user matrix (whose columns are the user 
vectors). 

The success of factor models is usually attributed to the intuition 
that the coordinate space used to represent items and users actually 
is a latent feature space. That is, its dimensions capture the items' 
perceptual properties as well as the users' preference judgments re- 
garding these properties. For example, when items are movies, the in- 
dividual dimensions are generally thought to measure (more of less) 
"obvious" features such as horror vs. romance, the level of sophis- 
tication, or orientation towards adults. For users, each coordinate is 
thought to describe the relative degree of importance attached to the 
respective dimension. This understanding of factor models can be 
found throughout the literature, for example, in t2l ll5|[T6lll8ll23l . 

Although it is intuively appealing, to our knowledge, the corre- 
spondence to features has never been systematically proven, but is 
only reported anecdotically. For example, Koren et al. 1 16 1 performed 
a factorization on the Netflix movie data set and manually interpreted 
the first two coordinates for selected movies as follows: 

Someone familiar with the movies shown can see clear mean- 
ing in the latent factors. The first factor has on one side lowbrow 
comedies and horror movies, aimed at a male or adolescent au- 
dience, while the other side contains drama or comedy with 
serious undertones and strong female leads. The second factor- 
ization axis has independent, critically acclaimed, quirky films 
on the top, and on the bottom, mainstream formulaic films. 

Further evidence has been provided by Takacs et al. 1231 . After per- 
forming a factorization of the Netflix data set, they manually as- 
signed labels to individual dimensions of their coordinate space, such 
as. Legendary, Typical for men, Romantic, and NOT Monty Python. 

In this paper, we propose a systematic method for studying the co- 
ordinate spaces derived from factor models and apply it the Movie- 
Lens 10M data set, a large real-world collection of movie ratings. The 
main contribution of our work consists in laying important ground- 
work, on which further research in recommender systems and prefer- 
ence handling can be build. In particular, we see two concrete direc- 
tions for future work: 

• First, knowing what kind of semantic information is extracted by 
factor models — and how it is represented in coordinate spaces — 
will enable a deeper understanding of these methods. Ultimately, 
these findings may lead to a more systematic development and 
refinement of recommender systems. In particular, a systematic 
assessment of semantic structures provides an additional way of 
evaluating the effectiveness of factor-based recommenders. This 
would perfectly complement traditional evaluation methods [111, 
which focus on predictive accuracy. 



• Second, we believe that factor models might be a powerful tool 
for automatically extracting meaningful descriptions of otherwise 
hard-to-describe items such as movies or songs — particularly, es- 
sential features of movies cannot be characterized at all by purely 
technical features such as runtime, language, or release date^But 
given a coordinate representation of movies that matches human 
perception, the full machinery developed in preference handling 
research can be applied (6][9)- For example, clustering techniques 
can give user an initial high-level impression of the available 
items, item rankings can be learnt from ordinal preference state- 
ments 1 10 1 or utilities |5|, and the best items can be retrieved by 
means of Top-k algorithms [ 12 1. 

Since our primary research interest lies in applying preference- 
based retrieval techniques to item collections, in this paper we will 
concentrate on evaluating the semantic structures contained in the 
item matrix A. Performing a similar analysis of the user matrix B 
may require entirely different methods. 

The paper is structured as follows: After introducing notation and 
reviewing the most important factor models, we develop general 
guidelines on how to evaluate coordinate spaces for semantic infor- 
mation. Then, we illustrate how to apply these guidelines to the eval- 
uation of factor spaces generated from movie rating data and perform 
experiments on the MovieLens 10M data set. 

2 PRELIMINARIES 

In the following, we use the variables i and j to identify items, 
whereas u and v denote users. We are dealing with ratings given to / 
items by U users. Let R = (n iU ) € {K U 0} /xU be the correspond- 
ing rating matrix, where rj lU = 0, if item i has not been rated by 
user u; otherwise, r\, u expresses the strength of user it's preference 
for item i. Ratings are usually limited to a fixed integer scale (for 
example, one to ten stars). Moreover, 1Z = {(i, it) ri jtl 7^ 0} is the 
set of all item-user pairs for which ratings are known. Let n be the 
total number of ratings observed (the cardinality of 1Z). Typically, n 
is very small compared to the number of possible ratings / • U (for 
example, in the Netflix data set it is j^j « 1.4%). 

Given some target dimensionality d, the basic idea underlying fac- 
tor models is to find matrices A = (a^, ) £ R Ixd and B — (b ryU ) £ 
R dxC/ such that their product R — A ■ B closely resembles R on 
all known entries. To quantify this notion of "close resemblance," 
the sum of squared errors (SSE) is popularly chosen. The SSE differ- 
ence between the rating matrix R and its estimation R — (fi, u ) is 
defined as 

SSE(R,R) = Y, {n,u-h,uf- 

Factor models are typically formulated as optimization problems 
over A and B, in which the SSE (or some other measure) is to be 
minimized. 

Probably the most popular factor model is Brandyn Webb's regu- 
larized SVD model [16 18], in which A and B are defined as the 
solution of the least squares problem 

d 

min SSE(i?, A ■ B) + A ^ ^ (a-, r + b^ u ) . 

Here, A > is a regularization constant used to avoid overfitting. 

2 A complementary approach to closing this semantic gap is content-based 
image and video retrieval 151 . 



More advanced versions of the SVD model exclude systematic 
rating deviations from the factorization and model them explicitly 
using new variables. Bell and Koren |3| propose to estimate rating 

n, u by 

d 

fi,u — fJ. + Si + S u + 0-i,rK,u, 
r=l 

where the constant fi denotes the mean of all observed ratings; Si and 
S u are I + U new model parameters expressing systematic item and 
user deviations from fj,, Again, the parameters are chosen according 
to a regularized least squares problem: 

^miri SSE(R, R) + A ^ | ^ K> + + Si + S u \ . 

(i,u)£K \r=l / 

The rationale underlying this approach — which we refer to as <5-S VD 
in the following — is that the removal of item- and user-specific gen- 
eral trends from the factorization allows to focus on more sophisti- 
cated rating patterns. 

The third basic factor model being relevant to our work performs 
a non-negative factorization of the rating matrix [23] ■ It is identical 
to the regularized SVD model up to the additional constraint that all 
entries of A and B must be non-negative. Extending this model by 
explicit item and users deviations is not reasonable since this would 
require negative entries in A and B to approximate R close enough. 
The non-negative matrix factorization model aims at creating a co- 
ordinate space in which effects of different dimensions on the esti- 
mated ratings cannot cancel out each other. Henceforth, we refer to 
this model as NNMF 

3 EVALUATING COORDINATE SPACES 

Given an item-feature matrix A £ R /Xd generated by some factor 
model, how can we determine whether the items' coordinates in this 
d-dimensional space resemble a "semantically meaningful" pattern? 
The most straightforward approach consists in extending and system- 
atizing the casual investigations described in the introduction. This 
could easily be done by presenting the item coordinate space to a 
number of different people and asking them to label its dimensions. 
The correspondence between the generated item coordinates and hu- 
man perception could, for example, be done by measuring the degree 
of consensus among people or the average time needed to come up 
with adequate labels. 

Although this kind of investigation seems very reasonable, it con- 
tains some severe flaws, which cannot be fixed by careful study de- 
sign: 

1 . The dimensionality chosen in most applications of factor models 
typically ranges between d = 10 and d = 100. A comprehensive 
analysis of the resulting data sets would require the users to com- 
prehend high-dimensional spaces, which is impossible even when 
using advanced visualization techniques. 

2. Due to hindsight bias, given enough time, users will be able to as- 
sign a fitting label to almost any dimension of the coordinate space. 
Chances are good that this effect accounts for rather questionable 
labels such as NOT Monty Python. 

3. By using free association to name dimensions, the collection of 
resulting labels tend to show a high variability and reflect individ- 
ual differences between users. To produce statistically significant 
results, either the sample size must be extended (which requires 
more study participants and results in higher costs), or the vari- 
ability must be reduced, for example, by training participants to 



use an established domain-specific vocabulary to articulate the se- 
mantic properties they recognize in the data (which also increases 
time and effort). 

4. Typically, there are many near-optimal solutions to the above men- 
tioned optimization problems, which can be transformed into one 
another by rotation of the coordinate axes. This is because, for 
any invertible matrix M G R d , the solution pairs (A, B) and 
(AM, M~ x B) produce the same SSE. Although regularization 
usually enforces the theoretical existence of a unique optimal solu- 
tion pair, in practice the enormous problem size often allows only 
finding one of the many near-optimal solutions. Consequently, the 
direction of the coordinate axes is completely arbitrary, which 
makes the task of assigning labels a hopeless undertaking. 

3.1 Some Guidelines 

In this section, we devise a set of guidelines on which to base more 
appropriate approaches to the analysis of coordinate spaces. 

• In the view of problems (1) and (4), we recommend to avoid any 
direct human interaction with item coordinates. Instead, human 
input should concentrate on describing item properties, which in 
turn are related to coordinates as well as compared by algorithmic 
means. 

• The only effective way to eliminate hindsight bias (2) is collecting 
feedback on items before generating and presenting any informa- 
tion extracted by the factor models under consideration. 

• To resolve problem (3), we primarily recommend to adapt a 
domain-specific vocabulary to allow a structurized description of 
items. For example, to characterize music, the rich vocubulary 
developed by allmusit0 seems appropriate; amongst others, it in- 
cludes very detailed information about genres, styles, moods, and 
connections between artists. Since this kind of semantic informa- 
tion can be (or already have been) provided by a small number of 
experts and usually is little prone to debate, it is easy to assem- 
ble and work with. In later stages of analysis, unrestricted user 
feedback may be included to reveal the position and extent of 
more fine-grained and rather subjective concepts in the coordinate 
space. 

We also propose to apply a standardization procedure to the gener- 
ated coordinate space. This is for the following reasons: First, recall 
that, for any invertible matrix M € R d , the solution pairs (A, B) and 
(AM, M _1 B) are equivalent; to enable comparisons between differ- 
ent factor models and even different runs of the same optimization 
algorithms, we need to define one solution pair as the standard repre- 
sentation. Second, to enable a better separation of different effects in 
the data, the axes of the item (and user) coordinate space should be 
chosen to be orthogonal. Moreover, axes should be ordered according 
to their relative importance (measured by the variance of data along 
each axis); that is, the first dimension should be assigned to the most 
important axis. 

The perfect tool for matching these requirements is the singular 
value decomposition, a well-known matrix factorization technique 
from linear algebra, which inspired the S VD factor model. It is based 
on the fact that, for any rank-d matrix X G I x U, there is a column- 
orthonormal matrix U G R Ixd , a diagonal matrix S G R dxd , and 
a row-orthonormal matrix V G x U such that X — U SV. By re- 
ordering rows and columns, S can be chosen such that its diagonal 

3 |http : //www . allmusic . com| 



elements are ordered by increasing magnitude. Moreover, the diag- 
onal matrix S can be eliminated from this factorization by setting 
X = U'V, where U' = US? and V = USi. The matrices U' 
and V' are unique if all diagonal elements of S have been mutually 
different. 

In our setting, we will apply the singular value decomposition to 
transform the product X — A ■ B into a new product A' ■ B' as just 
described. Since rating data tends to be very "noisy," we can safely 
assume that (A' , B') is a unique representation of (A, B); we did not 
encounter any counterexamples during our experiments on large real- 
world rating data. Moreover, any equivalent pair (AM, M _1 B) also 
gets transformed into (A', B'), which we define as the correspond- 
ing standard representation. It can be computed efficiently using the 
product decomposition algorithm proposed in [7 Sec. 3]. 

3.2 Use Case: Movie Ratings 

Based on these guidelines, we now present a concrete method for 
performing a basic evaluation of coordinate spaces generated from 
movie ratings. Our focus rests on immediate applicability, so we re- 
late the item coordinates to reference data that is already available. 

The reference source for all kinds of movie-related information is 
IMDb, the Internet Movie Databasfl which currently covers about 
1.6 million titles. Most of IMDb's data has been created with the help 
of its users. Therefore, a large proportion of the available content can 
freely be downloaded and used for non-commercial purpose Based 
on this comprehensive data, one should be able to cross-reference any 
collection of movie ratings with IMDb. 

For the semantic evaluations we are going to perform, the follow- 
ing attributes of titles may prove helpful: genres, certifications (e.g., 
USA:PG for parental guidance suggested), year of release, and plot 
keywords. To illustrate the general procedure, we will only exploit 
genre information in this paper. Extendig our method to other types 
of semantic information is straightforward. Checking the correspon- 
dence between genres and item coordinates also makes up a good 
first test of whether at least some basic semantic properties of movies 
are represented in coordinate spaces, which is exactly the purpose of 
the current work. 

IMDb recognizes 28 different genres, from Action to Western, 
where each movie may belong to multiple genres. The assignment 
of genres is done by IMDb's expert staff in cooperation with IMDb 
users. To enforce consistency, this process is based upon a collection 
of publicly available guideline^ Therefore, this data source matches 
the requirements developed in the previous section. 

To analyze whether the distribution of genres in coordinate space 
displays any significant pattern, we turn to established classification 
algorithms, which explicitly have been designed to exploit any rel- 
evant patterns in the data if there are any. In particular, we propose 
to measure the degree of adherence to a pattern by the classification 
accuracy shown by these algorithms when predicting the genre of 
movies based on their coordinates. In essence, we transform our anal- 
ysis into a sequence of binary classification problems (one for each 
genre), which enables us to build on solid grounds. Following the 
common methodology, we use cross-validation; that is, accuracy is 
measured on a data set, which is independent of the one used to train 
the classifier. By applying proven techniques to counter overfitting, 
our approach also overcomes any possible problems related to hind- 
sight bias. 
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For a start, we selected two popular classification algorithms, 
which are able to detect different kinds of patterns in the data: support 
vector machines and kNN-classifiers. 

Support vector machines will be used in two different flavors: first, 
using a linear kernel (refered to as SVM-lin), and second, using a 
Gaussian radial basis function kernel (SVM-RBF). Linear support 
vector machines will show a high classification accuracy if most 
movies of the respective genre are grouped at one side of the data set, 
which can be separated from all remaining movies by a hyperplane. 
For example, this can be used to disprove the hypothesis that there ex- 
ists a direction in the coordinate space along which, say, the amount 
of action, increases monotonically. In contrast, the SVM-RBF classi- 
fier detects whether groups of movies with the same genre tend to be 
located in close vincinity. 

kNN-classifiers perform well if the distance between movies hav- 
ing the same genre typically is smaller than the distance to movies 
not having this genre. Therefore, they can be used to check whether 
genres form spatially separated patterns in coordinate space. Since 
factor models are not based on a notion of proximity, it is not clear 
what measure of distance suits factor models best. We will try out the 
following four measures: Euclidean distance, standardized Euclidean 
distance (where, to ensure equally weighted dimensions, coordinate 
values are divided by the standard deviation of the data with respect 
each dimension), negative scalar product (which essentially adapts 
the method of rating prediction to measure distance), and cosine sim- 
ilarity (which is monotonically related to the angle between two vec- 
tors). 

To evaluate the true benefit of coordinate spaces generated from 
factor models, we propose the following baseline, which is derived 
from traditional neighborhood-based recommendation methods 1201 
and constructed as follows: First, for any items i and j, we compute 
their Pearson correlation coefficient 

Bi,j = "i '■ "i '■ i 

where IZi.j is the set of all users who rated both i and j, and 
is the mean rating given to item i by users who rated both i and 
j. If IZi.j is empty, then Qij is undefined. The Pearson correlation 
coefficient Qij measures the tendency of users to rate items i and j 
similarly. To avoid biased estimates in cases where ru >3 - = \TZi,j | is 
very small, we derive a new measure of similarity 

mj + A 

from Qij by shrinking towards zero 1 15 1. Here, A > is a regulariza- 
tion parameter. Finally, we carry over these similarity into distances 
by applying a logarithmic transformation: 

To derive a d-dimensional coordinate space in which items i and j 
approximately have distance dij, we use metric multidimensional 
scaling |4|. Since neighborhood-based recommendation methods are 
usually outperformed by factor models, we expect our baseline coor- 
dinate space to be far inferior to those constructed using factor mod- 
els. We refer to our baseline model as MDS. 



4 EXPERIMENTS ON MOVIELENS 10M 

We applied our approach to the MovieLens 10M data setQ which 
consists of about 10 million ratings collected by the online movie 
recommender service MovieLenfl After postprocessing the original 
data (removing one non-existing movie, merging several duplicate 
movie entries, and removing movies that received less than 20 rat- 
ings), our new data set consists of 9,984,419 ratings of 8938 movies 
provided by 69878 users. The ratings use a 10-point scale from 0.5 
(worst) to 5 (best). Each user contributed at least 14 ratings. 

Our analysis requires the genre information maintained by IMDb, 
so we had to map each movie in the data set to its corresponding 
IMDb entry. This task has been simplified a lot by the fact that 
all items in the MovieLens 10M data set are relatively well-known 
movies developed for cinema0 We mapped about 8000 movies au- 
tomatically by comparing titles and release years; the remaining 
movies have been assigned manually or semi-automatically. 

To avoid the problem of learning from very small samples for now, 
we did not use all 28 genres distinguished by IMDb. Instead, we take 
only those genres into consideration that have been assigned to at 
least 5% of all movies in our data set. Table[T] lists all remaining 13 
genres and their relative frequencies. On average, 2.3 genres have 
been assigned to each movie. 



Genre 


% 


Genre 


% 


Action 


16.0 


Horror 


10.1 


Adventure 


12.7 


Mystery 


9.1 


Comedy 


38.2 


Romance 


25.2 


Crime 


16.6 


Sci-Fi 


8.6 


Drama 


54.6 


Thriller 


24.2 


Family 


8.4 


War 


5.2 


Fantasy 


8.3 







Table 1. Relative frequencies of genres. 



4.1 Generating Coordinate Spaces 

We implemented each of the four coordinate extraction methods in 
MATLAB and executed them on our rating data. 

For SVD, 5-SVD, and NNMF, we followed the literature and used 
an optimization procedure based on gradient descent; to reduce com- 
putation time, we applied the Hessian speedup proposed in 1191 . 
Adapting the common methodology, we chose the regularization pa- 
rameter A by cross-validation such that the SSE is minimized on ran- 
domly chosen test sets. We ended up with a value of A = 0.04 for 
each of the three algorithms. 

Since optimization by gradient descent is known to get stuck in 
local extrema of the function to be minimized, we ran the three pro- 
cedures at least three times, each with different initial coordinates, 
which have been chosen randomly. For each result, we computed the 
standardized solution pair as described in the previous section. We 
found that the solutions generated by each extractor do not differ 
significantly after standardization. This indicates that our coordinate 
spaces match the unique solution of each optimization problem. 

For our MDS procedure, we used the regularization constant A = 
20, which we determined by adapting the recommendation Koren 

7 |http : / /www, grouplens ■ org/node/73] 
8 |trttpT77ww7Tmc*^7TeTensTorg| 

9 This is the reason why we did not consider the Netflix data set. It consists of 
all kinds of DVD titles, which often lack a clear correspondence in IMDb. 



gave for the Netfiix data set 1 15 1. The coordinates have been gener- 
ated by MATLAB's mdscale function using the metric stress cri- 
terion. Since in our data set about 14 percent of all movie-movie 
pairs had no raters in common, we treated the respective entries of 
the distance matrix as missing data. 

To measure the effect of dimensionality, we generated three differ- 
ent coordinate spaces with each extractor by varying the parameter d. 
We chose d = 10, d = 50, and d = 100. 

4.2 Applying the Classifiers 

In total, we used 14 different classifiers to evaluate each of the 12 
coordinate spaces with respect to each of the 13 genres. 

We implemented the two support vector machine classifiers by 
soft-margin SVMs with parameters C = 4 and (for SVM-RBF) 
7 = 0.1, which have been determined by cross-validation to max- 
imize classification accuracy. 

Each of the four different kNN-classifiers will be applied to the 
data sets with three different choices of k. To measure whether 
movies of the same genre tend to occur in larger groups, we chose 
k — 1, k = 3, and k = 9. In the following, we will refer to these 12 
classifiers as fcNN-Eucl, fcNN-sEucl, fcNN-scal, and fcNN-cos. 

To enable comparisons among classifiers and data sets, we gener- 
ated 20 pairs of training and test sets, each by randomly chosing 40% 
of all movies for training and 10% (of the remaining movies) for 
testing. For each of the resulting 2184 combinations of coordinate 
spaces, classifiers and genres, we use the same 20 pairs of item sets 
for training and testing. In each case, we measured the classification 
accuracy. All results reported below are averages over the 20 runs. 

4.3 Results 

Probably the most popular way of assessing a classifier's perfor- 
mance is measuring its accuracy, that is, the fraction of test items 
which have been classified correctly. However, in our setting, this 
measure is not very helpful. To see this, recall that the relative fre- 
quency of genres is very different in our data set. For example, over 
half of all movies belong to the genre Drama, but there are only about 
5% War movies. While attaining an accuracy of 95% would be sig- 
nificant for the genre Drama, it can easily be achieved for the genre 
War just by classifying any movie as non-War. To enable compar- 
isons across genres, we propose to use a modified version of Cohen's 
kappa measure. 

Any result of a binary classification task can be described by four 
numbers, which sum up to 1: the fraction of true positives (Qt p ), 
the fraction of false positives («f p ), the fraction of false negatives 
(afn), and the fraction of true negatives (a tn )- Accuracy is defined 
as acc = Q Cp + a ln . Moreover, the accuracy of a static majority- 
based classifier (which always returns the label of the more frequent 
class) is acc ma j = max{a lp + Qf n , ctf v + a tn }. We propose to use 
this kind of naive classifier for normalizing the accuracy and define 
k = (acc — accmaj)/(l — acc ma j). This measure expresses a clas- 
sifier's relative performance with respect to the majority-based clas- 
sifier. If acc = 1 then k = 1, if acc > acc mil j, then k > 0, if 
acc = acc ma j, then k = 0, and if acc < acc mil j, then k < 0. 

By measuring accuracy in terms of k, we can average classifica- 
tion performance over different genres. Tables[2j(5] report the mean 
ks over all 260 classification results obtained for each combination 
of coordinate space and classifier type. All entries larger than 0.10 
have been marked in boldface. We can observe the following: 





cvn i n 

o V U- 1U 


O V IJ-JU 


cvn 1 nn 

o V U- 1 uu 


SVM-lin 


0.08 


0.18 


0.20 


SVM-RBF 


0.15 


0.23 


0.25 


lNN-Eucl 


-0.24 


-0.21 


-0.19 


3NN-Eucl 


0.01 


0.05 


0.04 


9NN-Eucl 


0.12 


0.16 


0.14 


INN-sEucI 


-0.25 


-0.27 


-0.31 


3NN-sEucl 


0.01 


0.00 


—0.06 


9NN-sEucl 


0.12 


0.12 


0.04 


INN-scal 


-0.42 


-0.30 


-0.30 


3NN-scal 


-0.16 


-0.03 


-0.03 


9NN-scal 


0.01 


0.11 


0.12 


INN-cos 


-0.25 


-0.18 


-0.16 


3NN-cos 


0.00 


0.06 


0.06 


9NN-cos 


0.12 


0.17 


0.16 


Table 2. Kappas for coordinates generated by SVD. 




(5-SVD-10 


5-SVD-50 


5-SVD-100 


SVM-lin 


0.07 


0.16 


0.18 


SVM-RBF 


0.13 


0.20 


0.23 


lNN-Eucl 


-0.26 


-0.26 


-0.26 


3NN-Eucl 


-0.01 


0.01 


-0.02 


9NN-Eucl 


0.11 


0.12 


0.08 


INN-sEucI 


-0.26 


-0.29 


-0.36 


3NN-sEucl 


0.00 


-0.03 


-0.11 


9NN-sEucl 


0.11 


0.09 


-0.01 


INN-scal 


-0.41 


-0.28 


-0.22 


3NN-scal 


-0.06 


0.02 


0.06 


9NN-scal 


0.05 


0.13 


0.16 


INN-cos 


-0.26 


-0.19 


-0.16 


3NN-cos 


0.00 


0.07 


0.09 


9NN-cos 


0.12 


0.18 


0.19 


Table 3. Kappas for coordinates generated by <5-SVD. 


NNMF-10 


NNMF-50 


NNMF- 100 


SVM-lin 


0.02 


0.05 


0.11 


SVM-RBF 


0.02 


0.09 


0.14 


lNN-Eucl 


-0.56 


-0.47 


-0.41 


3NN-Eucl 


-0.20 


-0.16 


-0.13 


9NN-Eucl 


-0.02 


0.01 


0.02 


INN-sEucI 


-0.56 


-0.47 


-0.45 


3NN-sEucl 


-0.20 


-0.16 


-0.16 


9NN-sEucl 


-0.02 


0.01 


0.00 


INN-scal 


-0.37 


-0.34 


-0.34 


3NN-scal 


-0.11 


-0.10 


-0.09 


9NN-scal 


-0.02 


0.00 


0.02 


INN-cos 


-0.56 


-0.45 


-0.41 


3NN-cos 


-0.20 


-0.15 


-0.13 


9NN-cos 


-0.03 


0.02 


0.03 


Table 4. Kappas for coordinates generated by NNMF. 




MDS-10 


MDS-50 


MDS-100 


SVM-lin 


-0.16 


0.15 


0.19 


SVM-RBF 


0.03 


0.16 


0.17 


lNN-Eucl 


-0.29 


-0.19 


-0.18 


3NN-Eucl 


-0.01 


0.06 


0.06 


9NN-Eucl 


0.13 


0.18 


0.18 


INN-sEucI 


-0.29 


-0.23 


-0.29 


3NN-sEucl 


-0.01 


0.05 


-0.01 


9NN-sEucl 


0.13 


0.17 


0.12 


INN-scal 


-0.29 


-0.19 


-0.18 


3NN-scal 


-0.01 


0.07 


0.08 


9NN-scal 


0.12 


0.18 


0.18 


INN-cos 


-0.28 


-0.18 


-0.16 


3NN-cos 


0.00 


0.07 


0.08 


9NN-cos 


0.13 


0.19 


0.19 



Table 5. Kappas for coordinates generated by MDS. 



• The coordinate space derived by NNMF does not contain much 
helpful information about genres that can be exploited by our clas- 
sifiers. The performance in all other spaces is significantly better. 

• Except for NN-sEucl, classification performance generally im- 
proves with increasing dimensionality. However, the difference in 
performance between d = 10 and d = 50 is much larger than 
the one between d = 50 and d — 100. This indicates that our or- 
dering of dimensions during standardization indeed captures some 
notion of relative importance. This is probably also the reason for 
NN-sEucl's decreasing performance with growing d; treating all 
dimensions equally seems to overweight information from dimen- 
sions at the end of the list. 

• The SVM-RBF classifier slightly outperforms SVM-lin, but is 
comparable in performance to 9NN-Eucl, 9NN-scal, and 9NN- 
cos. This indicates that genres indeed tend to cluster in coordinate 
spaces, even with respect to different measures of distance. 

• The NN-classifiers display bad performance for k = 1 and k = 3, 
which indicates that, although movies of the same genre roughly 
occur in clusters, each cluster usually also contains movies that do 
not have assigned the respective genre. 

• In contrast to our expectations, the performance in coordinate 
spaces generated by factor models is comparable to the perfor- 
mance shown on our baseline coordinate space MDS. 

Moreover, the results suggest that the performance of fcNN- 
classifiers might even further increase for larger values of k. To check 
this, we performed some preliminary tests with k m 20, but have not 
been able to confirm this conjective. 

We also investigated the influence of individual genres on classi- 
fication performance; as an example, the results for SVM-RBF are 
reported in Table|6] Entries larger than 0.20 have been indicated. We 
can see that some genres, such as Horror and Drama, can clearly be 
identified by the classifier, while others cannot. We have expected 
much better performance on clear-cut genres such as War. 





SVD-100 


5-SVD-100 


NNMF-100 


MDS- 100 


Action 


0.34 


0.31 


0.22 


0.22 


Adventure 


0.13 


0.12 


0.08 


0.00 


Comedy 


0.45 


0.42 


0.25 


0.42 


Crime 


0.08 


0.06 


-0.01 


0.00 


Drama 


0.47 


0.43 


0.37 


0.44 


Family 


0.43 


0.46 


0.31 


0.34 


Fantasy 


0.03 


0.05 


0.01 


0.00 


Horror 


0.56 


0.54 


0.31 


0.61 


Mystery 


0.06 


0.04 


-0.00 


0.00 


Romance 


0.11 


0.10 


-0.00 


0.00 


Sci-Fi 


0.23 


0.20 


0.09 


0.00 


Thriller 


0.31 


0.27 


0.14 


0.15 


War 


0.05 


0.06 


-0.00 


0.00 



Table 6. Kappas for SVM-RBF by genre. 



In summary, these preliminary experiments suggest that the co- 
ordinate spaces derived by SVD, 5-SVD, and MDS indeed contain 
some significant semantic information about the represented movies. 
However, the situation is by far not as clear as claimed by the litera- 
ture. 

5 CONCLUSION AND OUTLOOK 

In the current paper, we presented a general methodology for sys- 
tematically analyzing whether coordinate spaces generated from fac- 
tor models contain semantic information, as it is commonly claimed. 



We applied our approach to the MovieLens 10M data set and found 
initial evidence for this claim. 

Our results encourage us to follow this line of research in several 
ways. First, we would like to investigate whether our results also 
carry over to more advanced and complex factor models, which have 
been proposed very recently [13 15 1. It would also interesting to see 
what more traditional methods such as multidimensional scaling can 
contribute to the problem of feature extraction from rating data, since 
our results indicate that these methods can sucessfully be modified 
for use in our new setting. 
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